Reinforcement Learning: An Introduction: From Dynamic Programming to Sampled Experience

The Great Leap: From Math to Experience

In the realm of Dynamic Programming, we were like gods with a map; we knew the exact probability $p(s', r | s, a)$ of every possible wind gust and terrain shift. But the real world rarely provides such a map. Monte Carlo (MC) methods represent a fundamental shift in philosophy: we stop calculating expectations over models and start learning from sampled experience.

The Mechanics of Sampling

MC methods utilize the interaction of policy evaluation and policy improvement within the Generalized Policy Iteration (GPI) framework. Instead of bootstrapping from neighboring estimates, we play an episode to the very end and calculate the actual return $G_t$.

First-visit MC: We only average the return following the first time a state is encountered in an episode.
Every-visit MC: We average returns from all encounters. Both converge to $V_\pi$ as data scales.
No Bootstrapping: Because estimates are based on independent terminal returns, MC is robust against non-Markovian dynamics where current states might hide past secrets.

Visual Analysis & Theory Deep-Dive

Blackjack and MC Convergence

You are analyzing the performance of a Monte Carlo agent learning Blackjack (Figure 5.2). The state is defined by the player's sum, the dealer's showing card, and whether the player has a usable Ace.

1. Consider the diagrams on the right in Figure 5.2. Why does the estimated value function jump up for the last two rows in the rear? Why does it drop off for the whole last row on the left? Why are the frontmost values higher in the upper diagrams than in the lower?

Answer:
The 'rear' represents the dealer showing an Ace; the jump occurs because an Ace provides the dealer flexibility, but for the player, dealer Aces often lead to specific high-value strategies. The 'drop off' on the left occurs because the player's sum is high (e.g., 21), where the probability of winning is maximal, but any further 'hits' would cause a bust. The frontmost values are higher in the upper diagrams because the 'Usable Ace' acts as a buffer—if the player hits and exceeds 21, the Ace converts to a value of 1, preventing a bust and maintaining a higher expected value compared to hands without a usable Ace.

2. Explain the 'No Bootstrapping' nuance in Monte Carlo methods.

Answer:
Unlike DP or Temporal Difference learning, MC methods do not update their value estimates based on the estimates of subsequent states. Instead, they wait until the end of the episode to receive the actual terminal return $G_t$. This means the estimate is anchored in real outcomes rather than other potentially inaccurate guesses, making it less sensitive to violations of the Markov property.